# Loading packages
library(readtext)
library(readxl)
library(dplyr)
library(tidyverse)
library(tidytext)
library(tm)
library(textstem)
library(wordcloud)
library(slam)
library(topicmodels)
library(SentimentAnalysis)
library(igraph)
library(ggraph)
library(widyr)
library(viridis)
library(plotly)
library(ggplot2)
library(lubridate)
library(wordcloud2)
library(stringr)
library(tidyr)
The main objective of this paper is to analyse the Federal Open Market Commitee statements using text mining methods and tools provided by R. We start with basic analysis of the length of each statement, through word counts, sentiment and correlation analysis and also topic modelling.
Federal Open Market Committee (FOMC) is the body of the central bank of United States (the Federal Reserve System). Its main duties is setting the national monetary policy. The FOMC holds eight regularly scheduled meetings per year. At these meetings, the Committee reviews economic and financial conditions, determines the appropriate stance of monetary policy, and assesses the risks to its long-run goals of price stability and sustainable economic growth. The FOMC consists of 12 voting members: seven members of the Board of Governors, the president of the Federal Reserve Bank of New York and 4 of the remaining 11 Reserve Bank presidents, who serve one-year terms on a rotating basis. All 12 of the Reserve Bank presidents attend FOMC meetings and participate in FOMC discussions, but only the presidents who are Committee members at the time may vote on policy decisions. FOMC meetings typically are held eight times each year in Washington, D.C., and at other times as needed.
The Committee releases a public statement immediately after each FOMC meeting. Each statement follows very similar structure. Firstly, the general background of the economic situation is presented. Then the Commitee introduces the value of the established federal funds rate and also share predictions. At the end, there are listed names of people which voted for the FOMC monetary policy action.
We sourced the data by scraping the statements from the Federal Reserve official website 1 using Python. In the scraping algorithm we limited the content only to FOMC announcment, omitting the names of voters listed in the last paragraph. The analysed period includes years from 2006 to 2018 which resulted in obtaining 107 documents.
# Loading scrapped statements
# DATA_DIR <- "C:/Users/KAndr/OneDrive/Studia/II rok I semestr/Text mining/Text mining project/Statements/"
# DATA_DIR <- "~/FOMC-text-mining/Statements"
DATA_DIR <- "~/Desktop/FOMC-text-mining/Statements"
fomc_2006 <- readtext(paste0(DATA_DIR, "/2006/*"))
fomc_2007 <- readtext(paste0(DATA_DIR, "/2007/*"))
fomc_2008 <- readtext(paste0(DATA_DIR, "/2008/*"))
fomc_2009 <- readtext(paste0(DATA_DIR, "/2009/*"))
fomc_2010 <- readtext(paste0(DATA_DIR, "/2010/*"))
fomc_2011 <- readtext(paste0(DATA_DIR, "/2011/*"))
fomc_2012 <- readtext(paste0(DATA_DIR, "/2012/*"))
fomc_2013 <- readtext(paste0(DATA_DIR, "/2013/*"))
fomc_2014 <- readtext(paste0(DATA_DIR, "/2014/*"))
fomc_2015 <- readtext(paste0(DATA_DIR, "/2015/*"))
fomc_2016 <- readtext(paste0(DATA_DIR, "/2016/*"))
fomc_2017 <- readtext(paste0(DATA_DIR, "/2017/*"))
fomc_2018 <- readtext(paste0(DATA_DIR, "/2018/*"))
# Binding data
statements <- rbind(fomc_2006, fomc_2007, fomc_2008, fomc_2009, fomc_2010, fomc_2011,
fomc_2012, fomc_2013, fomc_2014, fomc_2015, fomc_2016, fomc_2017, fomc_2018)
# Removing files from memory
remove(fomc_2006, fomc_2007, fomc_2008, fomc_2009, fomc_2010, fomc_2011,
fomc_2012, fomc_2013, fomc_2014, fomc_2015, fomc_2016, fomc_2017, fomc_2018)
We start our work on statments with the initial preprocessing of the dataset. It consists of two columns: doc_id and text. Doc_id is sourced from each statement’s website link. Text is just a content of the statement.
head(statements, 1)
## readtext object consisting of 1 document and 0 docvars.
## # Description: df[,2] [1 × 2]
## doc_id text
## * <chr> <chr>
## 1 20060131.txt "\"The Federa\"..."
# adding an unique ID
statements <- statements %>% mutate(ID = 1:n())
# setting column names
colnames(statements) <- c("Date", "Text", "ID")
# modification of doc_id column - changing it to date column
statements$Date <- gsub(".txt", "", statements$Date)
statements$Date <- as.Date(statements$Date, "%Y%m%d ")
statements_all <- as.vector(statements$Text)
length(statements_all)
## [1] 107
The next step was concerting the dataset into volatile corpora which is a handful form in the following operations. Below can be seen an example statement before any text preprocessing operations applied.
(corpus_all <- VCorpus(VectorSource(statements_all)))
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 107
inspect(corpus_all[[1]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 778
##
## The Federal Open Market Committee decided today to raise its target for the federal funds rate by 25 basis points to 4-1/2 percent. Although recent economic data have been uneven, the expansion in economic activity appears solid. Core inflation has stayed relatively low in recent months and longer-term inflation expectations remain contained. Nevertheless, possible increases in resource utilization as well as elevated energy prices have the potential to add to inflation pressures. The Committee judges that some further policy firming may be needed to keep the risks to the attainment of both sustainable economic growth and price stability roughly in balance. In any event, the Committee will respond to changes in economic prospects as needed to foster these objectives.
We start preprocessing with text cleaning using tm_map() function. We lower each case, remove words from the built-in stopwords list, we remove punctuation, unnecessary whitespaces and numbers. At the end we apply PlainTextDocument() function.
corpus_clean <- corpus_all %>%
tm_map(tolower) %>%
tm_map(removeWords, stopwords("en")) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace) %>%
tm_map(removeNumbers) %>%
tm_map(PlainTextDocument)
Below can be seen examples of the statements after above cleaning steps.
as.character(corpus_clean[[1]])
## [1] " federal open market committee decided today raise target federal funds rate basis points percent although recent economic data uneven expansion economic activity appears solid core inflation stayed relatively low recent months longerterm inflation expectations remain contained nevertheless possible increases resource utilization well elevated energy prices potential add inflation pressures committee judges policy firming may needed keep risks attainment sustainable economic growth price stability roughly balance event committee will respond changes economic prospects needed foster objectives "
In order to ease operations on the corpus, we modify it into a data frame.
df_corpus <- data.frame(text = unlist(sapply(corpus_clean, `[`, "content")), stringsAsFactors = F)
df_corpus <- df_corpus %>% mutate(doc_id = 1:n())
df_corpus$text[1]
## [1] " federal open market committee decided today raise target federal funds rate basis points percent although recent economic data uneven expansion economic activity appears solid core inflation stayed relatively low recent months longerterm inflation expectations remain contained nevertheless possible increases resource utilization well elevated energy prices potential add inflation pressures committee judges policy firming may needed keep risks attainment sustainable economic growth price stability roughly balance event committee will respond changes economic prospects needed foster objectives "
In the next step, we add column with cleaned text to statements data frame. Another important operation is text lemmatization which brings the word to its basic form which is grammatically correct. The example result is printed below.
statements_clean <- statements %>%
mutate(cleaned_text = df_corpus$text)
statements_clean$cleaned_text <- lemmatize_strings(statements_clean$cleaned_text)
statements_clean$cleaned_text[1]
## [1] "federal open market committee decide today raise target federal fund rate basis point percent although recent economic datum uneven expansion economic activity appear solid core inflation stay relatively low recent month longerterm inflation expectation remain contain nevertheless possible increase resource utilization good elevate energy price potential add inflation pressure committee judge policy firm may need keep risk attainment sustainable economic growth price stability roughly balance event committee will respond change economic prospect need foster objective"
At this stage, we also count number of words occuring in the original statement and in the cleaned statement in order to further analyse the trend of length of statements over time.
count_cleaned_word <- statements_clean %>%
unnest_tokens(word_count, cleaned_text) %>%
count(ID, word_count, sort = T) %>%
group_by(ID) %>%
summarize(word_cleaned_count = sum(n))
statements_clean_count <- left_join(statements_clean, count_cleaned_word, by = 'ID')
count_word <- statements_clean_count %>%
unnest_tokens(word_count, Text) %>%
count(ID, word_count, sort = T) %>%
group_by(ID) %>%
summarize (word_count = sum(n))
statements_final <- left_join(statements_clean_count, count_word, by = 'ID')
On the plot below we can see line plots of word counts over time, one for statements before and one after cleaning. The number has increased over time for both types until 2014 when the statements lenght started to decrease. The proportion of number of cleaned words to number of total words is quite constant and oscillates around 60%.
We tried to somehow correlate the changes of length of statements with any event and we discovered that it matches to Fed Chair. Until 2014 the Chairman was Ben Bernanke. After that Janet Yellen took over the role. From February 2018, the Fed Chair is Jerome Powell.
myplot <- statements_final %>%
select(Date, word_count, word_cleaned_count) %>%
ggplot() +
geom_line(aes(x = Date,
y = word_count),
color = viridis(10)[3]) +
geom_line(aes(x = Date,
y = word_cleaned_count),
color = viridis(10)[6]) +
labs(x = "Date",
y = "Number of words",
title = "Comparison of number of words between original and cleaned <br>statements content over time") +
scale_x_date(date_breaks = "1 year",
date_labels = "%Y") +
theme_minimal()
ggplotly(myplot)
The Zipf’s law in context of text mining, states that the frequency of a word is inversely proportional to it’s ordered rank. We decided to check the accuracy of the law empirically on our dataset. Below is a dataset with included zipf’s frequency.
word_counts_zipf <- statements_clean_count %>%
mutate(year = year(Date)) %>%
unnest_tokens(word_count, cleaned_text) %>%
count(word_count, sort = T)
word_count <- word_counts_zipf
colnames(word_count) <- c("word", "count")
word_count <- word_count %>%
mutate(word = factor(word, levels = word),
rank = row_number(),
zipfs_freq = ifelse(rank == 1, count, dplyr::first(count) / rank))
word_count
## # A tibble: 860 x 4
## word count rank zipfs_freq
## <fct> <int> <int> <dbl>
## 1 committee 1070 1 1070
## 2 inflation 839 2 535
## 3 will 638 3 357.
## 4 economic 556 4 268.
## 5 market 554 5 214
## 6 rate 476 6 178.
## 7 federal 445 7 153.
## 8 continue 364 8 134.
## 9 remain 339 9 119.
## 10 labor 331 10 107
## # … with 850 more rows
Additionally we decided to present the visualisation on a point plot. Based on this visualisation we decided to remove words with the highest and lowest values, setting the cutoff to ranks 17 and 300 leaving only words in between.
p1 <- ggplot(word_count,
aes(x = rank, y = count,
color = rank,
text = paste("Word: ", word,
"<br>Frequency of word: ", count,
"<br>Rank: ", rank))) +
geom_point() +
labs(x = "Rank", y = "count", title = "Zipf's law visualization") +
scale_color_viridis_c() +
geom_vline(xintercept = 17) +
geom_vline(xintercept = 300) +
theme_minimal() +
theme(legend.position = "none")
ggplotly(p1, tooltip = "text")
large_zipf <- as.vector(word_count$word[1:17])
small_zipf <- as.vector(word_count$word[300:1174])
corpus_clean <- corpus_clean %>% tm_map(removeWords, large_zipf)
corpus_clean <- corpus_clean %>% tm_map(removeWords, small_zipf)
df_corpus <- data.frame(text = unlist(sapply(corpus_clean, `[`, "content")), stringsAsFactors = F)
df_corpus <- df_corpus %>% mutate(doc_id = 1:n())
statements_clean <- statements %>%
mutate(cleaned_text = df_corpus$text)
statements_clean$cleaned_text <- lemmatize_strings(statements_clean$cleaned_text)
Using term frequency–inverse document frequency statistic, we proceeded to further analyze the statements to find these words that carry more information than the others. It occured that the top 3 highest TF-IDF value have words bank, central and arrangement in 38 statement.
statements_words <- statements_clean %>%
mutate(year = year(Date)) %>%
unnest_tokens(word_count, cleaned_text) %>%
count(ID, year, word_count, sort = T)
statements_words_id <- statements_words %>%
bind_tf_idf(word_count, ID, n) %>%
arrange(desc(tf_idf))
statements_words_id
## # A tibble: 12,669 x 7
## ID year word_count n tf idf tf_idf
## <int> <dbl> <chr> <int> <dbl> <dbl> <dbl>
## 1 38 2010 bank 20 0.241 2.27 0.548
## 2 38 2010 central 7 0.0843 2.88 0.243
## 3 38 2010 arrangement 5 0.0602 3.29 0.198
## 4 14 2007 reserve 3 0.1 1.90 0.190
## 5 19 2008 bank 11 0.0775 2.27 0.176
## 6 14 2007 dislocation 1 0.0333 4.67 0.156
## 7 19 2008 central 7 0.0493 2.88 0.142
## 8 14 2007 operation 1 0.0333 3.98 0.133
## 9 14 2007 trade 1 0.0333 3.98 0.133
## 10 25 2008 bank 6 0.0541 2.27 0.123
## # … with 12,659 more rows
Using the computed statistic, we visualised the most important words per year. It is interesting to see balanceconsistent word in two years (2017 and 2018). Looking at raw data, we can see that it’s occurence is caused by the lack of space after dot which was removed by us during preprocessing. It is a clear proof that statements are copy-pasted from previous years even without fixing previous mistakes.
statements_words_year <- statements_words %>%
group_by(year, word_count) %>%
summarize(n = sum(n)) %>%
bind_tf_idf(word_count, year, n) %>%
arrange(desc(tf_idf)) %>%
ungroup()
pd <- statements_words_year %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word_count, levels = rev(unique(word_count)))) %>%
group_by(year) %>%
top_n(10) %>%
ungroup() %>%
arrange(year, tf_idf) %>%
mutate(order = row_number())
ggplot(pd, aes(order, tf_idf, fill = tf_idf)) +
geom_bar(show.legend = FALSE, stat = "identity") +
labs(x = NULL, y = "TF-IDF value") +
facet_wrap(~year, ncol = 2, scales = "free") +
scale_x_continuous(breaks = pd$order, labels = pd$word, expand = c(0,0)) +
scale_y_continuous(expand = c(0,0)) +
coord_flip() +
theme_minimal() +
scale_fill_viridis_c(direction=-1)
We also presented the frequency of words using wordclouds. The bigger and lighter the word the higher the frequency. This is much more visual and pleasant to interpret then the table. Based on the cloud below it occured that the most frequent words overall are percent, policy, funds, financial and range.
dtm <- TermDocumentMatrix(corpus_clean)
m <- as.matrix(dtm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 10)
## word freq
## conditions conditions 327
## securities securities 315
## funds funds 279
## financial financial 213
## range range 206
## agency agency 203
## employment employment 203
## pace pace 199
## information information 185
## growth growth 180
set.seed(1234)
wordcloud2(d %>% arrange(desc(freq)) %>% head(100), color=viridis(100, direction = -1), shape='circle', size=0.2, minRotation = -pi/2, ellipticity = .8)
Another analysis which we conducted is sentiment extraction for each statement. We decided to use Loughran-McDonald financial dictionary which includes positive and negative categories of sentiment. This dictionary was first presented in the Journal of Finance and has been widely used in the finance domain ever since.
tidy_statement <- statements_clean %>%
mutate(year = year(Date)) %>%
group_by(year) %>%
ungroup() %>%
unnest_tokens(word, cleaned_text)
tidy_statement <- tidy_statement %>%
select(year, ID, word)
LM_dict <- loadDictionaryLM()
LM_dict_pos <- as.data.frame(LM_dict$positiveWords)
LM_dict_pos$sentiment <- c("positive")
colnames(LM_dict_pos)[1] <- "word"
LM_dict_neg <- as.data.frame(LM_dict$negativeWords)
LM_dict_neg$sentiment <- c("negative")
colnames(LM_dict_neg)[1] <- "word"
LM_dict <- rbind(LM_dict_pos, LM_dict_neg)
Firstly, we check which positive and negative words are the most frequent in FOMC statements. Below can be seen heads of tables listing 5 most frequent positive and negative words.
statement_word_counts_pos <- tidy_statement %>%
inner_join(LM_dict) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() %>%
filter(sentiment == "positive")
head(statement_word_counts_pos, 5)
## # A tibble: 5 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 strengthen positive 85
## 2 progress positive 80
## 3 strong positive 80
## 4 good positive 67
## 5 effect positive 41
statement_word_counts_neg <- tidy_statement %>%
inner_join(LM_dict) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() %>%
filter(sentiment == "negative")
head(statement_word_counts_neg, 5)
## # A tibble: 5 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 slow negative 72
## 2 cost negative 31
## 3 diminish negative 28
## 4 weak negative 28
## 5 depress negative 22
As a final sentiment, we calculate the difference between the sum of positive and negative words in each statement. We presented the results for each statement on the plot below also marking the year. An interesting observation can be noticed from 2008 to 2012 where the sentiment has suddenly turned negative. This can be connected with the 2008 financial crisis. After that we can see a more positive sentiment that was getting higher each year.
statement_sentiment <- tidy_statement %>%
inner_join(LM_dict) %>%
count(year, ID, sentiment) %>%
spread(sentiment, n, fill=0) %>%
mutate(sentiment = positive - negative)
sentiment_plot <- ggplot(statement_sentiment, aes(as.factor(ID),
sentiment,
fill = sentiment,
text = paste("Statement id: ", ID,
"<br>Sentiment value: ", sentiment))) +
geom_col(show.legend = FALSE) +
facet_wrap(~year, ncol = 3, scales = "free_x") +
labs(x="", y="Sentiment") +
scale_fill_viridis_c(direction=-1) +
theme_minimal()
ggplotly(sentiment_plot, tooltip = "text") %>%
layout(autosize = F, width = 600, height = 800)
Topic modelling is unsupervised method for identifying topics in documents. LDA is a popular method for topic modelling. In this approach, it is assumed that every document is a combination of few topics and every topic is combination of words. What is important, is fact that one word can be a part of few topics, so topics can be somehow similiar to each other.
In LDA, we have to specify number of topics we want to obtain. I addition to that we can adjust two parameters alpha ane beta which corresponds to document-topic density and topic-word density.
lda_top_terms <- lda_topics %>%
group_by(topic) %>%
top_n(8, beta) %>%
ungroup() %>%
arrange(topic, -beta)
lda_top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE, fill = viridis(40)) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
labs(y="Value of beta parameter", x="") +
scale_x_reordered() +
theme_minimal()
In our case we choose 5 topics and based on graphs it is visible that topic number one, three and five can be similar due to the word “securities”
It is also possible to see how topics change over time. In our analysis it comes out that every statemnt is always based on 5 topics in similar proportions.
colors_viridis <- viridis(5)
topic_density <- ggplot(statements_topics, aes(x=Date)) +
geom_line(aes(y=topic1), color=colors_viridis[1]) +
geom_line(aes(y=topic2), color=colors_viridis[2]) +
geom_line(aes(y=topic3), color=colors_viridis[3]) +
geom_line(aes(y=topic4), color=colors_viridis[4]) +
geom_line(aes(y=topic5), color=colors_viridis[5]) +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
theme_minimal()
ggplotly(topic_density)
At the end we were curious which words are mostly correlated. To do so, we create graphs of words, if the words are connected it means that correlation is above given threshold, in our case 65%. It comes out that we have one big graph that are connected together and few pairs of words.
by_statement_word <- data.frame(lapply(by_statement_word, as.character), stringsAsFactors=FALSE)
# count words co-occuring within sections
word_pairs <- by_statement_word %>%
pairwise_count(word, ID, sort = TRUE)
word_pairs
## # A tibble: 128,094 x 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 fund open 106
## 2 open fund 106
## 3 target open 104
## 4 open target 104
## 5 fund target 104
## 6 target fund 104
## 7 time open 98
## 8 open time 98
## 9 outlook open 97
## 10 information open 97
## # … with 128,084 more rows
word_cors <- by_statement_word %>%
group_by(word) %>%
filter(n() >= 107) %>%
pairwise_cor(word, ID, sort = TRUE)
word_cors <- word_cors %>% filter(correlation<=1)
word_cors %>%
filter(correlation > .65) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_label(aes(label = name), repel = TRUE) +
theme_void()